Improving Distributional Similarity with Lessons Learned from Word Embeddings

Abstract

They reveal that much of the performance gains of word embeddings are due to certain system design choices, rather than the embedding algorithms themselves. Furthermore, they show that these modifications can be transferred to traditional distributional models, yielding similart gains. A recent study by Baroni et al. (2014) shows that new embedding methods consistently outperform the traditional methods by a non-trivial margin on many similarity-oriented tasks. But analysis by Levy and Goldberg shows that word2vec’s SGNS is implicitly factorizing a word-context PMI matrix.

Background

Four word representation methods are considered:

  1. the explicit PPMI matrix
  2. SVD factorization of said matrix
  3. SGNS
  4. GloVe

PPMI Matrix


$PMI(w,c)=log \frac{\hat{P}(w,c)}{\hat{P}(w)\hat{P}(c)}=log\frac{\#(w,c)\cdot |D| }{\#(w) \cdot \#(c)}$

$PPMI(w,c)=max(PMI(w,c),0)$

A well-known shortcoming of PMI, which persists in PPMI, is its bias towards infrequent events.

Transferable Hyperparameters

Adapt and apply the hyperparameters to count-based methods.

  1. pre-processing hyperparameters
  2. association metric hyperparameters
  3. post-processing hyperparameters

Pre-processing hyperparameters

Dynamic Context Windows (dyn)

context words can be weighted according to their distance from the focus word.

Subsampling

Subsampling is a method of diluting very frequent words, akin to removing stop-words.

Deleting Rare Words (del)

Delete rare words before creating context windows.

Association Metric Hyperparameters

  1. Shifted PMI(neg)
  2. Context distribution smoothing

Post-processing hyperparameters

  1. Adding context vector
  2. Eigenvalue Weighting
  3. Vector Normalization: the standard L2 normalization of $W$’s rows is consistently superior.

Experiments

Word Similarity

Six datasets:

  1. WordSim-353: divided into two datasets: WordSim Similarity and WordSim Relatedness
  2. MEN dataset
  3. Mechanical Turk dataset
  4. Rare Words dataset
  5. SimLex-999 dataset

Analogy

  1. MSR’s analogy dataset
  2. Google’s analogy dataset

Results

At times, changing hyperparameters can bring bigger improvement than changing to different representation method. In some tasks, careful hyperparameters tunning can also outweigh the importance of adding more data.

SVD is very useful. word2vec outperforms GloVe.

The prediction-based word embeddings are not superior to count-based approaches. The contradictory results in stem from creating word2vec embeddings with somewhat pre-tunned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representations.

3CosMul dominates 3CosAdd in every case.

There are a few works show that CBOW has s slight advantage compared to others. But in the paper of word2vec, it shows SGNS performs better.

Hyperparameter Analysis

Harmful Configurations

  1. SVD does not benefit from shifted PPMI
  2. Using SVD “correctly” is bad

Beneficial Configurations

PPMI and SVD‘s preference towards shorter context windows (win=2). SGNS always prefers numerous negative samples (neg>1). Only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds=0.75).

Practical Recommendations

  1. Always use context distribution smoothing to modify PMI.
  2. Do not use SVD “correctly” (eig=1).
  3. SGNS is a robust baseline. While it might not be the best method for every task, it does not significantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption.
  4. With SGNS, prefer many negative samples.
  5. For both SGNS and GloVe, it is worthwhile to experiment with $w+c$ variant.
分享到